>> Training a New Tokenizer

   If you have a language or data that is not well-supported by existing models, you might want to train a new model from scratch using a tokenizer that's tailored to your data. A tokenizer breaks down text into smaller units (like words or subwords) and is essential for training language models.

   In the Hugging Face Transformers library, you can train a new tokenizer based on an existing one using the `AutoTokenizer.train_new_from_iterator()` method. Here's how you can do that:

>> 1. Assembling a Corpus

   First, you'll need a dataset in the language or data format you're interested in. Let's use Python code as an example dataset:

python code

'''
from datasets import load_dataset

# Loading the Python part of the CodeSearchNet dataset
raw_datasets = load_dataset("code_search_net", "python")
'''

The dataset contains various pieces of information, but we are interested in the Python functions (`whole_func_string` column).

>> 2. Preparing the Data for Tokenization

   To train the tokenizer efficiently, we transform the data into an iterator of lists of texts:

python code

'''
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )

training_corpus = get_training_corpus()
'''

This method allows processing the dataset in smaller chunks, which is memory-efficient.

>> 3. Training the New Tokenizer

   With the corpus ready, you can now train a new tokenizer using an existing one (like GPT-2's tokenizer) as a base:

python code

'''
from transformers import AutoTokenizer

# Load the existing tokenizer (GPT-2 in this case)
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Train a new tokenizer using the corpus
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
'''

>> 4. Testing the New Tokenizer

   You can now test the new tokenizer to see how it handles the data:

python code

'''
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = tokenizer.tokenize(example)
print(tokens)
'''

The new tokenizer should be more efficient and tailored to the specifics of your data, such as handling code formatting better.

>> Key Points

   - Training a Tokenizer vs. Training a Model: Training a tokenizer is a deterministic process that builds a vocabulary based on the frequency of subwords in the dataset. It differs from model training, which is more random and involves optimizing the model's performance on tasks.

   - Efficient Data Handling: Use generators to handle large datasets without exhausting memory.

   - Existing Tokenizer as a Base: Starting with an existing tokenizer simplifies the process by keeping the same algorithm and special tokens while updating the vocabulary based on your data.